Projekt Description¶

The business model is a platform (crowd investing) where people who have a business idea, but do not have money they need, can register and raise funds for their project within a given time.

On the other hand, there are donors (creditors or investors) who would like to invest their money in projects and are looking for investments' opportunities.

The platform brings together borrowers and lenders as an intermediary.

Database contains historic data of the platform.

Additional assumptions

All projects are completed, i.e. the time to raise money for each project has expired. In the business model all the collected funds are paid out, even if the target amount has not been reached.

The platform earns money as a commission for each project that is placed with the help of the platform.

The lender receives an interest on the loan.

CRISP-DM: Business Understanding¶

Description of data fields:

  • funded_amount ... Amount received/ paid out in USD at the end of the "crowding" period
  • loan_amount ... Target amount (amount that a project wishes to achieve) in USD
  • activity ... Subcategory to which the goal of the crowdproject thematically belongs
  • sector ... Top category in which the crowdproject topic falls
  • use ... Brief description of what the money should be used for
  • country_code ... Country code according to ISO standard
  • country ... Country name according to ISO standard
  • region ... Region within the correspondent country
  • currency ... Currency in which the funded_amount was then paid out
  • term in months ... Term of repayment the loan
  • lender_count ...Lender (i.e. how many people placed their money in the project)
  • borrower_genders ... Gender and number of borrowers, i.e. those who initiated the crowdproject
  • repayment interval ... Contractually agreed repayment style

CRISP-DM: Data Preparation¶

In [1]:
# Importing the necessary libraries 
import pandas as pd         
import numpy as np 
import plotly.express as px

Data Reading¶

In [2]:
# Data first reading to understand what separator is used  
df_input = pd.read_csv("data_abschlussprojekt.csv", engine='python', nrows=2)
df_input

# Separator is # and the first column is index
Out[2]:
# funded_amount# loan_amount# activity# sector# use# country_code# country# region# currency# term_in_months# lender_count# borrower_genders# repayment_interval
0#300.0#300.0#Fruits & Vegetables#Food#To buy seasonal fresh fruits to sell. #PK#Pakistan#Lahore#PKR...
1#575.0#575.0#Rickshaw#Transportation#to repair and maintain the auto rickshaw used in their business.#PK#Pakistan#Lahore#PKR#11.0#14#female female#irregular
In [3]:
# Reading with correct separator and delete index column
df_input = pd.read_csv("data_abschlussprojekt.csv", sep='#', index_col = 0, engine='python')
df_input.head(2)
Out[3]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 female irregular
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 female, female irregular

Exploring Data¶

Getting to know the Data¶

General Information about Data

In [4]:
df_input.shape
# data has 13 columns and 671205 rows
Out[4]:
(671205, 13)
In [5]:
df_input.info()
# types of data, number of not-null data in each column, number of columns of each type, memory usage
<class 'pandas.core.frame.DataFrame'>
Int64Index: 671205 entries, 0 to 671204
Data columns (total 13 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0    funded_amount       671205 non-null  float64
 1    loan_amount         671205 non-null  float64
 2    activity            671205 non-null  object 
 3    sector              671205 non-null  object 
 4    use                 666973 non-null  object 
 5    country_code        671197 non-null  object 
 6    country             671205 non-null  object 
 7    region              614405 non-null  object 
 8    currency            671205 non-null  object 
 9    term_in_months      671205 non-null  float64
 10   lender_count        671205 non-null  int64  
 11   borrower_genders    666984 non-null  object 
 12   repayment_interval  671205 non-null  object 
dtypes: float64(3), int64(1), object(9)
memory usage: 71.7+ MB

Evaluation

Numeric data have numeric-type, string data have object-type

General statistical evaluation of data

In [6]:
df_input.describe()
Out[6]:
funded_amount loan_amount term_in_months lender_count
count 671205.000000 671205.000000 671205.000000 671205.000000
mean 785.995061 842.397107 13.739022 20.590922
std 1130.398941 1198.660073 8.598919 28.459551
min 0.000000 25.000000 1.000000 0.000000
25% 250.000000 275.000000 8.000000 7.000000
50% 450.000000 500.000000 13.000000 13.000000
75% 900.000000 1000.000000 14.000000 24.000000
max 100000.000000 100000.000000 158.000000 2986.000000

Evaluation results

Min -values seems to be correct, Max -values are located too far from median as well as from average, it is necessary to look carefully at the transactions with such a loan size, term and number of lenders

Checking and Correcting Column Names, Values, Data Formatting

In [7]:
df_input.columns

#column names have spaces at the beginning. Deleting with lstrip
df_input.columns= df_input.columns.str.lstrip()
df_input.columns
Out[7]:
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'use',
       'country_code', 'country', 'region', 'currency', 'term_in_months',
       'lender_count', 'borrower_genders', 'repayment_interval'],
      dtype='object')
In [8]:
# get the unique values for each column with 'object'-type to check the data
object_liste = df_input.select_dtypes(include='object').columns

for element in object_liste:
    print(element,'\n', df_input[element].unique())
activity 
 ['Fruits & Vegetables' 'Rickshaw' 'Transportation' 'Embroidery'
 'Milk Sales' 'Services' 'Dairy' 'Beauty Salon' 'Manufacturing'
 'Food Production/Sales' 'Wholesale' 'General Store' 'Clothing Sales'
 'Poultry' 'Tailoring' 'Sewing' 'Bakery' 'Restaurant' 'Food Stall'
 'Farming' 'Construction Supplies' 'Personal Products Sales'
 'Home Products Sales' 'Natural Medicines' 'Fish Selling'
 'Education provider' 'Shoe Sales' 'Machinery Rental' 'Butcher Shop'
 'Pigs' 'Personal Expenses' 'Food Market' 'Cosmetics Sales'
 'Personal Housing Expenses' 'Retail' 'Energy' 'Grocery Store'
 'Construction' 'Agriculture' 'Motorcycle Transport' 'Charcoal Sales'
 'Food' 'Pharmacy' 'Fishing' 'Timber Sales' 'Cattle' 'Electronics Repair'
 'Electronics Sales' 'Vehicle' 'Cafe' 'Blacksmith'
 'Higher education costs' 'Used Clothing' 'Fuel/Firewood' 'Upholstery'
 'Catering' 'Animal Sales' 'Cereals' 'Vehicle Repairs' 'Arts'
 'Cloth & Dressmaking Supplies' 'Mobile Phones' 'Spare Parts' 'Clothing'
 'Metal Shop' 'Barber Shop' 'Furniture Making' 'Crafts' 'Home Energy'
 'Home Appliances' 'Wedding Expenses' 'Taxi' 'Secretarial Services'
 'Livestock' 'Property' 'Recycling' 'Farm Supplies' 'Auto Repair'
 'Beverages' 'Plastics Sales' 'Electrical Goods' 'Carpentry' 'Photography'
 'Jewelry' 'Bricks' 'Pub' 'Phone Use Sales' 'Water Distribution'
 'Paper Sales' 'Computers' 'Liquor Store / Off-License' 'Utilities'
 'Knitting' 'Weaving' 'Party Supplies' 'Medical Clinic' 'Internet Cafe'
 'Consumer Goods' 'Cement' 'Electrician' 'Primary/secondary school costs'
 'Veterinary Sales' 'Land Rental' 'Laundry' 'Call Center' 'Perfumes'
 'Hotel' 'Motorcycle Repair' 'Movie Tapes & DVDs' 'Quarrying'
 'Personal Medical Expenses' 'Bookstore' 'Decorations Sales'
 'Recycled Materials' 'Office Supplies' 'Souvenir Sales'
 'Renewable Energy Products' 'Health' 'Printing' 'Phone Repair'
 'Traveling Sales' 'Flowers' 'Bicycle Repair' 'Entertainment'
 'Phone Accessories' 'Hardware' 'Used Shoes' 'Music Discs & Tapes' 'Games'
 'Balut-Making' 'Textiles' 'Child Care' 'Goods Distribution' 'Florist'
 'Cobbler' 'Dental' 'Bookbinding' 'Cheese Making' 'Bicycle Sales'
 'Well digging' 'Technology' 'Musical Performance' 'Waste Management'
 'Film' 'Tourism' 'Musical Instruments' 'Religious Articles'
 'Machine Shop' 'Cleaning Services' 'Sporting Good Sales' 'Patchwork'
 'Funerals' 'Air Conditioning' 'Communications' 'Adult Care'
 'Landscaping / Gardening' 'Aquaculture' 'Beekeeping' 'Event Planning'
 'Celebrations' 'Computer' 'Personal Care Products' 'Mobile Transactions']
sector 
 ['Food' 'Transportation' 'Arts' 'Services' 'Agriculture' 'Manufacturing'
 'Wholesale' 'Retail' 'Clothing' 'Construction' 'Health' 'Education'
 'Personal Use' 'Housing' 'Entertainment']
use 
 ['To buy seasonal, fresh fruits to sell. '
 'to repair and maintain the auto rickshaw used in their business.'
 'To repair their old cycle-van and buy another one to rent out as a source of income'
 ... 'Pretend the issue with loan got addressed by Kiva Coordinator.'
 'Kiva Coordinator replaced loan use. Should see this in viewdiff.'
 'Edited loan use in english.']
country_code 
 ['PK' 'IN' 'KE' 'NI' 'SV' 'TZ' 'PH' 'PE' 'SN' 'KH' 'LR' 'VN' 'IQ' 'HN'
 'PS' 'MN' 'US' 'ML' 'CO' 'TJ' 'GT' 'EC' 'BO' 'YE' 'GH' 'SL' 'HT' 'CL'
 'JO' 'UG' 'BI' 'BF' 'TL' 'ID' 'GE' 'UA' 'XK' 'AL' 'CD' 'CR' 'SO' 'ZW'
 'CM' 'TR' 'AZ' 'DO' 'BR' 'MX' 'KG' 'AM' 'PY' 'LB' 'WS' 'IL' 'RW' 'ZM'
 'NP' 'CG' 'MZ' 'ZA' 'TG' 'BJ' 'BZ' 'SR' 'TH' 'NG' 'MR' 'VU' 'PA' 'VI'
 'VC' 'LA' 'MW' 'MM' 'MD' 'SS' 'SB' 'CN' 'EG' 'GU' 'AF' 'MG' nan 'PR' 'LS'
 'CI' 'BT']
country 
 ['Pakistan' 'India' 'Kenya' 'Nicaragua' 'El Salvador' 'Tanzania'
 'Philippines' 'Peru' 'Senegal' 'Cambodia' 'Liberia' 'Vietnam' 'Iraq'
 'Honduras' 'Palestine' 'Mongolia' 'United States' 'Mali' 'Colombia'
 'Tajikistan' 'Guatemala' 'Ecuador' 'Bolivia' 'Yemen' 'Ghana'
 'Sierra Leone' 'Haiti' 'Chile' 'Jordan' 'Uganda' 'Burundi' 'Burkina Faso'
 'Timor-Leste' 'Indonesia' 'Georgia' 'Ukraine' 'Kosovo' 'Albania'
 'The Democratic Republic of the Congo' 'Costa Rica' 'Somalia' 'Zimbabwe'
 'Cameroon' 'Turkey' 'Azerbaijan' 'Dominican Republic' 'Brazil' 'Mexico'
 'Kyrgyzstan' 'Armenia' 'Paraguay' 'Lebanon' 'Samoa' 'Israel' 'Rwanda'
 'Zambia' 'Nepal' 'Congo' 'Mozambique' 'South Africa' 'Togo' 'Benin'
 'Belize' 'Suriname' 'Thailand' 'Nigeria' 'Mauritania' 'Vanuatu' 'Panama'
 'Virgin Islands' 'Saint Vincent and the Grenadines'
 "Lao People's Democratic Republic" 'Malawi' 'Myanmar (Burma)' 'Moldova'
 'South Sudan' 'Solomon Islands' 'China' 'Egypt' 'Guam' 'Afghanistan'
 'Madagascar' 'Namibia' 'Puerto Rico' 'Lesotho' "Cote D'Ivoire" 'Bhutan']
region 
 ['Lahore' 'Maynaguri' 'Abdul Hakeem' ... 'Gbenikoro Village' 'Morimaraia'
 'alejandria']
currency 
 ['PKR' 'INR' 'KES' 'NIO' 'USD' 'TZS' 'PHP' 'PEN' 'XOF' 'LRD' 'VND' 'HNL'
 'MNT' 'COP' 'GTQ' 'TJS' 'BOB' 'YER' 'KHR' 'GHS' 'SLL' 'HTG' 'CLP' 'JOD'
 'UGX' 'BIF' 'IDR' 'GEL' 'UAH' 'EUR' 'ALL' 'CRC' 'XAF' 'TRY' 'AZN' 'DOP'
 'BRL' 'MXN' 'KGS' 'AMD' 'PYG' 'LBP' 'WST' 'ILS' 'RWF' 'ZMW' 'NPR' 'MZN'
 'ZAR' 'BZD' 'SRD' 'NGN' 'VUV' 'XCD' 'MWK' 'LAK' 'MMK' 'ZWD' 'MDL' 'SSP'
 'SBD' 'CNY' 'EGP' 'MGA' 'NAD' 'LSL' 'THB']
borrower_genders 
 ['female' 'female, female' 'female, female, female' ...
 'female, female, male, female, female, female, female, female, female, female, male, male, female, female, male, female, female, female, female, female, female, female'
 'male, female, female, female, female, female, female, female, male, male, female, male, female, male, male, male'
 'female, female, female, male, female, female, female, male, female, female, female, male, female, male, female, female, female, female, female, female, female, female, female, female, female, female, female, female, male']
repayment_interval 
 ['irregular' 'bullet' 'monthly' 'weekly']

Evaluation

At first look there are no explicit missing values. Column 'use' has extra spaces to be deleted.

In [9]:
# Deleting of extra spaces
df_input['use'].str.strip()
Out[9]:
0                    To buy seasonal, fresh fruits to sell.
1         to repair and maintain the auto rickshaw used ...
2         To repair their old cycle-van and buy another ...
3         to purchase an embroidery machine and a variet...
4                                  to purchase one buffalo.
                                ...                        
671200    [True, u'para compara: cemento, arenya y ladri...
671201    [True, u'to start a turducken farm.'] - this l...
671202                                                  NaN
671203    [True, u'to start a turducken farm.'] - this l...
671204    [True, u'to start a turducken farm.'] - this l...
Name: use, Length: 671205, dtype: object

Duplicates¶

In [10]:
df_input.head(5)
Out[10]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 female irregular
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 female, female irregular
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 female bullet
3 200.0 200.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 female irregular
4 400.0 400.0 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 female monthly
In [11]:
df_duplicates = df_input.loc[df_input.duplicated(),:]
df_duplicates

# checking some data carefully
df_duplicates.query("activity == 'Home Energy' and sector == 'Personal Use' and country == 'El Salvador'")
df_duplicates.query("sector == 'Education' and country=='Kenya' and activity == 'Higher education costs' and lender_count == 18")
Out[11]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
363782 500.0 500.0 Higher education costs Education to pay for university tuition KE Kenya Nairobi KES 31.0 18 female monthly
374737 500.0 500.0 Higher education costs Education to pay for university fees KE Kenya Nairobi KES 32.0 18 female monthly
387675 500.0 500.0 Higher education costs Education to pay for university fees KE Kenya Nairobi KES 40.0 18 female monthly
387899 500.0 500.0 Higher education costs Education to pay for university fees KE Kenya Nairobi KES 40.0 18 female monthly
387977 500.0 500.0 Higher education costs Education to pay for university fees. KE Kenya Nairobi KES 40.0 18 male monthly
408822 500.0 500.0 Higher education costs Education to pay for university fees KE Kenya Nairobi KES 40.0 18 male monthly
574734 500.0 500.0 Higher education costs Education to pay for university fees. KE Kenya Nairobi KES 40.0 18 male monthly

Evaluation of duplicates

The data has 24372 duplicated rows (3,6% of all the data). The data does not contain information on unique code of each project and unique launch date of the project, which could tell us whether we are dealing with duplicates or similar deals. Since we cannot explicitly define whether these are duplicates or identical projects committed by different people with different start time, then by deleting them we may lose a useful information.

Deсision: not to delete this data

Missing Values¶

Looking for Synonyms¶

In [12]:
df_input["borrower_genders"].unique()
# Column borrower_genders includes not only gender values, but also lists of groups of people with different 
# or similar gender  
Out[12]:
array(['female', 'female, female', 'female, female, female', ...,
       'female, female, male, female, female, female, female, female, female, female, male, male, female, female, male, female, female, female, female, female, female, female',
       'male, female, female, female, female, female, female, female, male, male, female, male, female, male, male, male',
       'female, female, female, male, female, female, female, male, female, female, female, male, female, male, female, female, female, female, female, female, female, female, female, female, female, female, female, female, male'],
      dtype=object)
In [13]:
# checking data carefully
df_input.query("borrower_genders == 'female, female, female'")
Out[13]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
17 875.0 875.0 Tailoring Services To buy a sewing machine, lace, zippers and but... PK Pakistan Lahore PKR 11.0 25 female, female, female irregular
207 800.0 800.0 Clothing Sales Clothing to invest in working capital to stock their bu... NI Nicaragua Leon NIO 9.0 7 female, female, female monthly
379 2650.0 2650.0 Clothing Sales Clothing to buy new ladies' clothes and shoes. TZ Tanzania Dar es Salaam TZS 10.0 67 female, female, female irregular
427 900.0 900.0 Farming Agriculture to pay a plowing fee and a weeding fee and to ... KH Cambodia Kampong Cham Province,Memot district KHR 13.0 32 female, female, female monthly
515 675.0 675.0 Home Products Sales Retail To buy more variety of bed sheets to sell. PK Pakistan Rawalpindi PKR 12.0 23 female, female, female irregular
... ... ... ... ... ... ... ... ... ... ... ... ... ...
670632 225.0 2850.0 Cosmetics Sales Retail to invest in her business by buying cosmetics,... NI Nicaragua Managua NIO 14.0 5 female, female, female irregular
670683 2750.0 2750.0 Clothing Sales Clothing to buy clothes to sell. NI Nicaragua Managua NIO 8.0 22 female, female, female irregular
670957 0.0 1375.0 Clothing Sales Clothing to buy and sell traditional clothing such as g... GT Guatemala Suchitepéquez GTQ 14.0 0 female, female, female monthly
670979 0.0 625.0 Clothing Sales Clothing to buy and sell traditional clothing such as g... GT Guatemala Suchitepéquez GTQ 14.0 0 female, female, female monthly
671020 225.0 700.0 Agriculture Agriculture to buy chickens to sell. GT Guatemala Suchitepéquez GTQ 14.0 6 female, female, female monthly

11676 rows × 13 columns

In [14]:
# Checking: do we have projects where funded_amount == 0 and one or more lender at the same time
# Checking: do we have projects where funded_amount > 0 and there are no lender at the same time

df_input.query('(funded_amount == 0 and lender_count > 0) or (funded_amount > 0 and lender_count == 0)').shape[0]

# We have no such projects
Out[14]:
0

Looking for Conventions¶

In [15]:
# check what fields have nan-values
df_input.isna().sum().to_frame(name='nan').query('nan > 0')
Out[15]:
nan
use 4232
country_code 8
region 56800
borrower_genders 4221
In [16]:
# calulating the number of unique values in 'use'-column
print(f"Number of unique values in 'use'-column: {len(df_input['use'].unique())}")
Number of unique values in 'use'-column: 424913

Evaluation

  • country_code: check if we are able to replace nan-values with the correct codes
  • all the data that include nan-values have object-type:

    idea#1: we are dealing with interval scale, try to change for the most common value for each sector 
            (with mode function)
    idea#2: delete the data
    
    - borrow_genders: by deleting we may lose the useful data --> choose idea#1 
    
    - use: the data contains comments to transactions. There is no sense in replacing with the most common value by 
      groups. The column contains 424913 unique values. The not nan-values can be left as comments or the column can
      be deleted.
    
    - region: the data provides an additional information about the region of the transaction within single country. 
      There is no point in replacing the nan-values with the mode-calculated value for each country. It may be far from
      truth. The not nan-values can be left as an additional information or the whole column can be deleted.

What to do with the Missing Values?¶

In [17]:
# deleting columns 'region' and 'use' to save memory and speed up our analysis
df_input = df_input.drop(columns={'use','region'})
In [18]:
# checking for country_code
df_input.query("country_code.isna()")

df_input.query("country=='Namibia' or country_code=='NAM'")
Out[18]:
funded_amount loan_amount activity sector country_code country currency term_in_months lender_count borrower_genders repayment_interval
202537 4150.0 4150.0 Wholesale Wholesale NaN Namibia NAD 6.0 162 female bullet
202823 4150.0 4150.0 Wholesale Wholesale NaN Namibia NAD 6.0 159 male bullet
344929 3325.0 3325.0 Wholesale Wholesale NaN Namibia NAD 7.0 120 female bullet
351177 3325.0 3325.0 Wholesale Wholesale NaN Namibia NAD 7.0 126 male bullet
420953 3325.0 3325.0 Wholesale Wholesale NaN Namibia NAD 7.0 118 female bullet
421218 4000.0 4000.0 Wholesale Wholesale NaN Namibia NAD 7.0 150 male bullet
487207 5100.0 5100.0 Renewable Energy Products Retail NaN Namibia NAD 7.0 183 male bullet
487653 5000.0 5000.0 Wholesale Wholesale NaN Namibia NAD 7.0 183 female bullet
In [19]:
# Namibia has ISO-Code NAM
# we have no rows with NAM-Code, but we have the projects from Namibia 
# replacing nan-values for projects from Namibia with NAM

mask = (df_input["country_code"].isnull()) & (df_input["country"]=='Namibia')
df_input.loc[mask,"country_code"] = 'NAM'

# checking the result
df_input.loc[df_input["country"]=='Namibia',:]
Out[19]:
funded_amount loan_amount activity sector country_code country currency term_in_months lender_count borrower_genders repayment_interval
202537 4150.0 4150.0 Wholesale Wholesale NAM Namibia NAD 6.0 162 female bullet
202823 4150.0 4150.0 Wholesale Wholesale NAM Namibia NAD 6.0 159 male bullet
344929 3325.0 3325.0 Wholesale Wholesale NAM Namibia NAD 7.0 120 female bullet
351177 3325.0 3325.0 Wholesale Wholesale NAM Namibia NAD 7.0 126 male bullet
420953 3325.0 3325.0 Wholesale Wholesale NAM Namibia NAD 7.0 118 female bullet
421218 4000.0 4000.0 Wholesale Wholesale NAM Namibia NAD 7.0 150 male bullet
487207 5100.0 5100.0 Renewable Energy Products Retail NAM Namibia NAD 7.0 183 male bullet
487653 5000.0 5000.0 Wholesale Wholesale NAM Namibia NAD 7.0 183 female bullet
In [20]:
# changing the values where borrower_genders has more than one gender to a 'group_of_persons' 
df_input.groupby("borrower_genders").agg({"borrower_genders":'count'})

mask = df_input["borrower_genders"].str.contains(',')
df_input.loc[mask==True, "borrower_genders"] = "group_of_persons"
In [21]:
df_input["borrower_genders"].unique()
Out[21]:
array(['female', 'group_of_persons', 'male', nan], dtype=object)
In [22]:
# look: what is the most common value for each sector?

df_grp = (df_input.groupby(by=["sector","borrower_genders"],as_index=False)
          .agg(count = ("borrower_genders","count"))
          .sort_values(by=["sector","count"],ascending=[True,False])
         )

sector_liste = df_input.query('borrower_genders.isna()')['sector'].unique()
print(sector_liste,"\n")

for sector in sector_liste:
    mode_val= df_input.loc[df_input['sector']==sector,"borrower_genders"].mode()[0]
    print(f"{sector}: {mode_val}")
['Food' 'Personal Use' 'Services' 'Agriculture' 'Housing' 'Retail'
 'Clothing' 'Arts' 'Manufacturing' 'Construction' 'Education'
 'Entertainment' 'Health' 'Transportation' 'Wholesale'] 

Food: female
Personal Use: group_of_persons
Services: female
Agriculture: female
Housing: female
Retail: female
Clothing: female
Arts: female
Manufacturing: female
Construction: male
Education: female
Entertainment: male
Health: female
Transportation: female
Wholesale: female
In [23]:
# replacing the nan-values with the most common value for each sector

for sector in sector_liste:
    mode_val= df_input.loc[df_input['sector']==sector,"borrower_genders"].mode()[0]
    
    mask = (df_input["borrower_genders"].isnull()) & (df_input['sector']==sector)
    df_input.loc[mask,"borrower_genders"] = mode_val
In [24]:
#check: do we still have nan-values in borrower_genders?
df_input.loc[df_input["borrower_genders"].isnull(),"borrower_genders"]
Out[24]:
Series([], Name: borrower_genders, dtype: object)
In [25]:
# check: what nan-values do we have?
df_input.isnull().sum()
Out[25]:
funded_amount         0
loan_amount           0
activity              0
sector                0
country_code          0
country               0
currency              0
term_in_months        0
lender_count          0
borrower_genders      0
repayment_interval    0
dtype: int64

Memory Usage Optimization¶

In [26]:
# make a copy of dataset to compare further the result of optimization
df_opti=df_input.copy()
In [27]:
# checking the maximum float values
df_input.funded_amount.max(),df_input.funded_amount.max()   
Out[27]:
(100000.0, 100000.0)
In [28]:
# can be to float32 converted (max_value for float32: 3.4E+38, max_value for float16: 65536)
df_opti[["funded_amount","loan_amount"]] = df_opti[["funded_amount","loan_amount"]].astype("float32")
df_opti.dtypes
Out[28]:
funded_amount         float32
loan_amount           float32
activity               object
sector                 object
country_code           object
country                object
currency               object
term_in_months        float64
lender_count            int64
borrower_genders       object
repayment_interval     object
dtype: object
In [29]:
# checking for maximum values to make a decision about type depreciation
df_opti["term_in_months"].max(),df_opti["lender_count"].max()
Out[29]:
(158.0, 2986)
In [30]:
# check: Does 'term_in_months' have not-whole values?
for i in df_opti["term_in_months"]:      
    if i%1 != 0:
        print("Can't be to integer converted")
print("finished")

# can be to int16 converted (max_value for int16: 32767, max_value for int8: 127)
df_opti[["term_in_months","lender_count"]]= df_opti[["term_in_months","lender_count"]].astype("int16")
df_opti.dtypes
finished
Out[30]:
funded_amount         float32
loan_amount           float32
activity               object
sector                 object
country_code           object
country                object
currency               object
term_in_months          int16
lender_count            int16
borrower_genders       object
repayment_interval     object
dtype: object
In [31]:
#check: what data can be to category-type converted?
object_liste = df_opti.select_dtypes(include='object').columns

for element in object_liste:
    print(element, len(df_opti[element].unique()))
activity 163
sector 15
country_code 87
country 87
currency 67
borrower_genders 3
repayment_interval 4
In [32]:
# change the chosen data to category type
objects_to_category = ['sector','country_code','country','currency','borrower_genders','repayment_interval','activity']

for element in objects_to_category:
    df_opti[element] = df_opti[element].astype('category')
df_opti.dtypes
Out[32]:
funded_amount          float32
loan_amount            float32
activity              category
sector                category
country_code          category
country               category
currency              category
term_in_months           int16
lender_count             int16
borrower_genders      category
repayment_interval    category
dtype: object

Data Correlation¶

In [33]:
# creating a correlation matrix
list_of_numeric_columns =df_opti.select_dtypes(include='number').columns

df_corr = df_opti[list_of_numeric_columns].corr(method="pearson")
df_corr 
Out[33]:
funded_amount loan_amount term_in_months lender_count
funded_amount 1.000000 0.945044 0.149310 0.849168
loan_amount 0.945044 1.000000 0.184795 0.798697
term_in_months 0.149310 0.184795 1.000000 0.227283
lender_count 0.849168 0.798697 0.227283 1.000000
In [34]:
# building a plot for better understanding
fig_corr =px.imshow(df_corr, x =df_corr.columns,y=df_corr.index, text_auto='.2f')

fig_corr.update_layout(title_text = f"Correlation Matrix"                  
                    ,title_font_size =20
                    ,font_size=15
                    ,title_x =0.5
                    )
fig_corr.show()

Descriptive part

The larger the number and the lighter the square, the greater the correlation between two columns (positive correlation). We can talk about the strong correlation when a number is greater than 0.7.

Analytical part

There is a strong positive correlation between the loan amount and the amount of money collected, which indicates that the larger the loan amount, the greater the amount collected. That tells us that the most of projects independent of requested amount were successfull in part or in full.

The strong positive correlation is between the loan amount and the number of creditors. The larger the loan amount, the greater the number of creditors of the project. That means that investors do not want to invest much in one project and divide risks by investing small amounts in different projects.

The correlation between loan term and funding is weak, that tells us the probability to raise funds for a project is weakly dependent on the loan term.

In [35]:
# Percent of projects received the requested amount in full to total 
percent_of_full = (df_opti.query('funded_amount == loan_amount')['funded_amount'].count()/
                        df_opti['funded_amount'].count())*100
print(f"Percentage of Projects collected the full amount: {percent_of_full:.2f}%")
Percentage of Projects collected the full amount: 92.80%

Outliers / Extreme Values¶

In [36]:
df_opti.dtypes
Out[36]:
funded_amount          float32
loan_amount            float32
activity              category
sector                category
country_code          category
country               category
currency              category
term_in_months           int16
lender_count             int16
borrower_genders      category
repayment_interval    category
dtype: object
In [37]:
# building a box-plot to look at the extreme funded amounts by sectors

fig_boxplot = px.box(df_opti
                     ,x="sector"          
                     ,y="funded_amount"
                     ,title="Box plot"
                     ,hover_name='sector'
                     ,hover_data={'sector':False}                  
                    )

fig_boxplot.add_hline(
                  y=50000, 
                  line_width=2, 
                  line_dash="dash", 
                  line_color="red")

fig_boxplot.update_yaxes(title_text='Funded Amount, in USD')
fig_boxplot.update_xaxes(title_text="Sector")

fig_boxplot.update_layout(title_font_size =20
                        ,font_size=16
                        ,title_x =0.5
                        )

fig_boxplot.show()
In [38]:
df_opti.describe()['funded_amount']
Out[38]:
count    671205.000000
mean        785.994995
std        1130.307373
min           0.000000
25%         250.000000
50%         450.000000
75%         900.000000
max      100000.000000
Name: funded_amount, dtype: float64

Descriptive part

The box-plot of each sector is placed on the scale near 0 (the median is $450 and the scale size 0-100K). The points that are above 1.5 IQR of the upper edge of a box (higher than a whisker) show the sizes of deals that we consider to be extreme.

Analytical part

The plot shows that many sectors have the deals that significantly exceed the threshold and can be interpreted as extreme sums. There is a psychological boundary on the level of $50K as a requested amount (marked by a horizontal line) that only once was exceeded.

Large amounts need to have thorough developed business-plan or minimum detailed explanation and required more time to collect the amount.

In [39]:
# deals with max amount investigation
max_amount = df_opti['funded_amount'].max()
df_opti.query("funded_amount == @max_amount")

# It was one transaction only. The deal on $100K has 2986 investors with average investment amount of ~$33,5
Out[39]:
funded_amount loan_amount activity sector country_code country currency term_in_months lender_count borrower_genders repayment_interval
70499 100000.0 100000.0 Agriculture Agriculture HT Haiti USD 75 2986 female irregular

Create New Columns/Features¶

Add new column with currency names¶

In [40]:
# at first, read a file with currency names
df_currency = pd.read_csv("currency_names.csv", sep=";", engine='python')
df_currency["Currency_Country"] = df_currency["Currency_Country"].str.title() 
df_currency.head(5)
Out[40]:
Currency_Code Currency_Name Currency_Country
0 ALL Lek Albanien
1 AMD Armenischer Dram Armenien
2 AZN Aserbaidschanischer Manat Aserbaidschan
3 BIF Burundi Franc Burundi
4 BOB Boliviano Bolivien (Plurinationaler Staat Von)
In [41]:
# create dictionary with currency names
currency_dict=dict()

for i,curr in enumerate(df_currency["Currency_Code"]):
    currency_dict[curr] = df_currency.loc[i,"Currency_Name"]

currency_dict
Out[41]:
{'ALL': 'Lek',
 'AMD': 'Armenischer Dram',
 'AZN': 'Aserbaidschanischer Manat',
 'BIF': 'Burundi Franc',
 'BOB': 'Boliviano',
 'BRL': 'Brasilianischer Real',
 'BZD': 'Belize Dollar',
 'CLP': 'Chilenischer Peso',
 'CNY': 'Yuan Renminbi',
 'COP': 'Kolumbianischer Peso',
 'CRC': 'Costa Rican Colon',
 'DOP': 'Dominikanischer Peso',
 'EGP': 'Ägyptisches Pfund',
 'EUR': 'Euro',
 'GEL': 'Lari',
 'GHS': 'Ghana Cedi',
 'GTQ': 'Quetzal',
 'HNL': 'Lempira',
 'HTG': 'Gourde',
 'IDR': 'Rupiah',
 'ILS': 'Neuer israelischer Schekel',
 'INR': 'Indische Rupie',
 'JOD': 'Jordanian Dinar',
 'KES': 'KeniaSchilling',
 'KGS': 'Som',
 'KHR': 'Riel',
 'LAK': 'Kip',
 'LBP': 'Libanesisches Pfund',
 'LRD': 'Liberianischer Dollar',
 'LSL': 'Loti',
 'MDL': 'Moldovan Leu',
 'MGA': 'Madagaskar Ariary',
 'MMK': 'Kyat',
 'MNT': 'Tugrik',
 'MWK': 'Kwacha',
 'MXN': 'Mexikanischer Peso',
 'MZN': 'Mosambik Metical',
 'NAD': 'NamibiaDollar',
 'NGN': 'Naira',
 'NIO': 'Cordoba Oro',
 'NPR': 'Nepalesische Rupie',
 'PEN': 'Nuevo Sol',
 'PHP': 'Philippinischer Peso',
 'PKR': 'Pakistanische Rupie',
 'PYG': 'Guarani',
 'RWF': 'Rwanda Franc',
 'SBD': 'SalomonenDollar',
 'SLL': 'Leone',
 'SRD': 'Surinamischer Dollar',
 'SSP': 'Südsudanesisches Pfund',
 'THB': 'Baht',
 'TJS': 'Somoni',
 'TRY': 'Türkische Lira',
 'TZS': 'Tanzanian Shilling',
 'UAH': 'Hryvnia',
 'UGX': 'UgandaSchilling',
 'USD': 'US Dollar',
 'VND': 'Dong',
 'VUV': 'Vatu',
 'WST': 'Tala',
 'XAF': 'CFA Franc BEAC',
 'XCD': 'Ostkaribischer Dollar',
 'XOF': 'CFA Franc BCEAO',
 'YER': 'Jemenitische Rial',
 'ZAR': 'Rand',
 'ZMW': 'Sambische Kwacha',
 'ZWD': 'Zimbabwean dollar'}
In [42]:
# creating an additional column to dataset
df_opti["currency_name"] = df_opti.loc[:,"currency"].map(currency_dict)
df_opti.loc[:,["currency","currency_name"]]
Out[42]:
currency currency_name
0 PKR Pakistanische Rupie
1 PKR Pakistanische Rupie
2 INR Indische Rupie
3 PKR Pakistanische Rupie
4 PKR Pakistanische Rupie
... ... ...
671200 USD US Dollar
671201 KES KeniaSchilling
671202 KES KeniaSchilling
671203 KES KeniaSchilling
671204 KES KeniaSchilling

671205 rows × 2 columns

Create a column that shows the stake of an investor in a project¶

Answer to the question: What amount invest one lender in a project?

In [43]:
df_opti["lender_stake"] = df_opti['funded_amount']/df_opti['lender_count']
In [44]:
# fill nan-values with 0 - if funded_amount == 0 --> lender_count = 0
df_opti.loc[:,"lender_stake"] = df_opti.loc[:,"lender_stake"].fillna(0)
In [45]:
# check: do we have nan values in the column?
df_opti.loc[:,"lender_stake"].isnull().sum()
Out[45]:
0

Final Adjustment of Memory Usage¶

In [46]:
df_opti["lender_stake"].max() # fits float32
Out[46]:
9475.0
In [47]:
# check: what types of data have our additional columns?
df_opti.dtypes[df_opti.dtypes.index.isin(["currency_name","lender_stake"])]
Out[47]:
currency_name    category
lender_stake      float32
dtype: object
In [48]:
# memory usage by initial data
df_input.memory_usage(deep=True)
Out[48]:
Index                  5369640
funded_amount          5369640
loan_amount            5369640
activity              46521289
sector                43603755
country_code          39601103
country               43923974
currency              40272300
term_in_months         5369640
lender_count           5369640
borrower_genders      43075427
repayment_interval    43400106
dtype: int64
In [49]:
# memory usage by optimized data
df_opti.memory_usage(deep=True)
Out[49]:
Index                 5369640
funded_amount         2684820
loan_amount           2684820
activity              1357871
sector                 672752
country_code           678443
country                679029
currency               677329
term_in_months        1342410
lender_count          1342410
borrower_genders       671510
repayment_interval     671633
currency_name          678014
lender_stake          2684820
dtype: int64

Memory Reduction Calculation¶

In [50]:
reduction = (df_input.memory_usage(deep=True).sum()-df_opti.memory_usage(deep=True).sum()) \
            /df_input.memory_usage(deep=True).sum()*100

print(f" Memory reduction: {reduction:.2f}%")
 Memory reduction: 93.22%

EDA¶

The First Stage of Analysis¶

In what currencies borrow the projects?¶

In [51]:
# look: in what currencies borrow the projects?
df_currency = (df_opti.groupby(by=["currency_name","sector","country"])
               .agg(funded_sum=("funded_amount","sum"))
               .reset_index()
               .sort_values(by=["currency_name","funded_sum"], ascending=[True,False])
              )
             
df_currency_total = (df_opti.groupby(by=["currency_name","currency"])
                   .agg(funded_sum=("funded_amount","sum"))
                   .reset_index()
                   .sort_values(by=["funded_sum"], ascending=[False])
                  )

df_data_graphic = df_currency_total.head(10)
In [52]:
title_bar = "Projects Funded Volumes by Top-10 Currencies"

bar_plot = px.bar(data_frame=df_data_graphic
                        ,x="currency_name"
                        ,y="funded_sum" 
                        ,color='funded_sum'
                        ,hover_name='currency_name'
                        ,hover_data={'currency_name':False,
                                     'funded_sum':':$.4s'}
                 )

bar_plot.update_yaxes(title_text='Funded Amount, in USD')
bar_plot.update_xaxes(title_text="Currency")

bar_plot.update_layout(title_text = title_bar                       
                    ,legend_title_text = 'Funded Amount'
                    ,title_font_size =20
                    ,font_size=16
                    ,title_x =0.5
                    )

bar_plot.show()

Descriptive part

The higher and lighter the column, the greater the amount placed in this currency.

Analytical part

The maximum falls on US Dollar, that means that greater part financed in this currency. Second and third places are occupied by national currencies for the countries where the most projects are placed.

Usually the lenders wish to place their money in so called hard currency (that is converted everywhere and has low inflation rate) and the borrowers prefer to receive money in the national currency, in which they earn or have revenue from the project.

Probably, using USD as project currency, it is a compromise among borrowers and lenders in each specific project or, as might be, a psycholgical trick of borrowers to look more serious with their proposal.

In [53]:
df_opti.columns
Out[53]:
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'country_code',
       'country', 'currency', 'term_in_months', 'lender_count',
       'borrower_genders', 'repayment_interval', 'currency_name',
       'lender_stake'],
      dtype='object')

Short look at the projects with funded amount > 10K in sector 'Personal use'¶

In [54]:
# short look at the projects with funded amount > 10K in sector 'Personal use'
df_opti.query('funded_amount>10000 and sector.str.contains("Personal")')
Out[54]:
funded_amount loan_amount activity sector country_code country currency term_in_months lender_count borrower_genders repayment_interval currency_name lender_stake
31000 10150.0 10150.0 Home Energy Personal Use KE Kenya KES 86 344 group_of_persons irregular KeniaSchilling 29.505814
63841 13425.0 13425.0 Home Energy Personal Use KE Kenya KES 86 408 group_of_persons irregular KeniaSchilling 32.904411
95132 13025.0 13025.0 Home Energy Personal Use KE Kenya KES 86 352 group_of_persons irregular KeniaSchilling 37.002842
331247 11250.0 11250.0 Home Energy Personal Use KE Kenya KES 62 376 group_of_persons irregular KeniaSchilling 29.920214

Evaluation

Deals are placed in national currency, there are no explicit repayment intervals, term is long (5-7 years), the borrowers are group of persons. The deals look of high risk, we don't have an information about the source of money to pay back the loans.

One investor' stake is bounded by $37 (recalculated amount from KES) to limit the risk.

Who are our borrowers?¶

In [55]:
#look at the borrowers' portrait. Who are they?

df_pie=(df_opti.groupby(by=["borrower_genders"],observed=True).funded_amount.agg(["count","median","mean","sum"])
       # .agg(count_of_projects = ("funded_amount","count"))
        .sort_values(by=["count"], ascending=[False])
        .reset_index()
       )
df_pie

# We have the information about gender only
Out[55]:
borrower_genders count median mean sum
0 female 430501 400.0 564.634094 243075536.0
1 male 134794 525.0 783.729675 105642056.0
2 group_of_persons 105910 1050.0 1688.662354 178846224.0
In [56]:
# building a plot to have a better look at the borrowers
values = df_pie["count"]
names = df_pie["borrower_genders"]

order = {"borrower_genders":names.tolist()}

fig_pie = px.pie(df_pie, 
             names=names, 
             values=values,
             color_discrete_sequence=["yellow","lightblue","darkblue"],
             category_orders=order,
             title= "Who are our Borrowers?",
             hover_name="borrower_genders",
             hover_data={"borrower_genders": False}
            )


fig_pie.update_traces(textposition='inside', 
                      textinfo='percent+label',
                      insidetextorientation='horizontal',
                     )

fig_pie.update_layout(legend_title_text = 'Borrowers Type'
                    ,title_font_size =20
                    ,font_size =14
                    ,title_x =0.5)

fig_pie.show()

Discriptive part

The larger and lighter the stake, the more transactions (projects) were funded by this client type with the help of our platform. In each sector, the type of client is indicated, as well as the share he has. The 'female' or 'male' means an individual of the correspondent gender and by 'group_of_persons' we understand a group of individuals of more than one person.

Analytical part

We receive the fee from each transaction, that's why it makes sense to compare the borrowers by number of deals. In addition we have strong positive correlation between sum of funded amount and number of deals that gives us an opportunity to use number of deals as key parameter.

The females occupy a share of more than 60%, males three times less and the group of persons four times less than females. It means that in developing countries where the most projects are placed men usually work for one of the companies and do not have a time or wish to look for other finance opportunities in internet*. From the other hand, it might be a psychological trick: people tend to trust woman more as a person who is usually responsible for house and children, and understand the meaning of the word 'responsibility'.

*It should be noted that in some places there is no internet at all or it's very expensive. So people buy a computer with internet access together with the others, sometimes one for many families.

Let's look in what sectors these projects were proposed.

The Second Stage of Analysis¶

In [57]:
# define top 10-sectors where the number of projects is the largest

df_bar =(df_opti.groupby(by=["sector"],observed=True).funded_amount.agg(["count","median","mean","sum"])
        .sort_values(by=["count"], ascending=[False])
        .reset_index()
       )
df_bar

list_top_sectors =df_bar["sector"].head(10)
In [58]:
df_top_sector = df_opti.query('sector.isin(@list_top_sectors)')

df_bar2=(df_top_sector.groupby(by=["borrower_genders","sector"],observed=True).funded_amount
         .agg(["count","median","mean","sum"])
         .sort_values(by=["borrower_genders","count"], ascending=[True,False])
         .reset_index()
        )
df_bar2.head(5)
Out[58]:
borrower_genders sector count median mean sum
0 female Agriculture 100683 400.0 576.697388 58063624.0
1 female Food 99670 350.0 494.818695 49318580.0
2 female Retail 97269 350.0 488.781006 47543240.0
3 female Services 30833 425.0 707.392212 21811024.0
4 female Housing 23184 475.0 560.115356 12985715.0
In [59]:
# build a plot of top 10 sectors by borrowing of different type of project makers
title_fig_bar = "Top 10 Sectors by Borrowing"
color_map={'female':'yellow','male':'lightblue','group_of_persons':'darkblue'}

fig_bar = px.bar(data_frame=df_bar2
                        ,x="sector"
                        ,y="count" 
                        ,color="borrower_genders"
                        ,color_discrete_map=color_map
                        ,hover_name='sector'
                        ,hover_data={'sector':False,
                                     'count':':.3s',
                                      'sum':':$.3s',
                                      'median':':$.1f'
                                    }
                         ,category_orders={"borrower_genders":['female','male','group_of_persons']}
                         ,barmode='group'
                 )

fig_bar.update_layout(title_text = title_fig_bar                       
                    ,legend_title_text = 'Borrowers Type'
                    ,title_font_size =20
                    ,font_size=16
                    ,title_x =0.5
                    )

fig_bar.update_yaxes(title_text='Number of Projects')
fig_bar.update_xaxes(title_text="Sector")

fig_bar.show()

Discriptive part

The higher the column, the greater the number of projects were invested by lenders. In yellow reflected the projects proposed by females, in lightblue - by males and in darkblue - by group of individuals.

The graph shows the breakdown of projects by sectors and borrowers types. For analysis we use top 10 sectors by deals as the sectors are of most important.

Analytical part

The graph shows that the main projects placed in the next sectors: Agriculture, Food and Retail. It is typical for non-industrialized developing countries, where these sectors play usually significant role. The main project-maker in these sectors are females by the reasons we have described earlier.

Let's look further in what countries the proposed projects plan to be realized/realizing or have been realized (we do not have information when the projects are realizing).

Assumptions: We consider that the project-maker realizes or have realized the project in his own country. We have no information how successful this realization was and what the lenders get back.

The Third Stage of Analysis¶

In [60]:
df_top_sector.head(5)
Out[60]:
funded_amount loan_amount activity sector country_code country currency term_in_months lender_count borrower_genders repayment_interval currency_name lender_stake
0 300.0 300.0 Fruits & Vegetables Food PK Pakistan PKR 12 12 female irregular Pakistanische Rupie 25.00000
1 575.0 575.0 Rickshaw Transportation PK Pakistan PKR 11 14 group_of_persons irregular Pakistanische Rupie 41.07143
2 150.0 150.0 Transportation Transportation IN India INR 43 6 female bullet Indische Rupie 25.00000
3 200.0 200.0 Embroidery Arts PK Pakistan PKR 11 8 female irregular Pakistanische Rupie 25.00000
4 400.0 400.0 Milk Sales Food PK Pakistan PKR 14 16 female monthly Pakistanische Rupie 25.00000
In [61]:
df_bar3=(df_top_sector.groupby(by=["borrower_genders","sector","country"],observed=True).funded_amount
         .agg(["count","median","mean","sum"])
         .sort_values(by=["borrower_genders","sector","count"], ascending=[True,True,False])
         .reset_index()
       )
In [62]:
# create function to remove unused categories to make cuts from our dataset
def remove_cats(df):
    for col in df.columns:
        if df[col].dtype.name == 'category': 
            df[col] = df[col].cat.remove_unused_categories()
    return df
        
# limit the threshold of output data: > 3000   - to make the graph readable
df_bar4 = df_bar3.query("count > 3000").copy()

# remove unused part of categories
df_bar4=remove_cats(df_bar4)
In [63]:
# rename the columns names for better hover data understanding  
df_bar4 = df_bar4.rename(columns={'count':'Number_of_Projects', 'median': 'Median','sum': 'Total_Amount', \
                                 'borrower_genders':'Borrower_Type','sector':'Sector'})
In [64]:
title_scatter = 'Top Countries by Borrowing with Number of Projects > 3000'

scatter_plot2 = px.scatter(data_frame=df_bar4
                        ,x="country"
                        ,y="Sector"
                        ,color="Sector"
                        ,symbol="Borrower_Type"
                        ,size="Number_of_Projects"
                        ,hover_name='country'
                        ,hover_data={'country':False,
                                     'Number_of_Projects':':,.3s',
                                      "Total_Amount":':$,.3s',
                                      "Median":':$.1f'
                                    }          
                          )

scatter_plot2.update_xaxes(title_text="Countries")
scatter_plot2.update_yaxes(title_text="Sector of Economics")

scatter_plot2.update_layout(title_text = title_scatter                       
                    ,legend_title_text = 'Sector, Borrower Type'
                    ,title_font_size =20
                    ,font_size=14
                    ,title_x =0.5
                    )

scatter_plot2.show()

Descriptive part The graph shows the breakdown of projects by countries, sector, borrower types, number of projects. Hover data displays information of total funded sum and median for this category additionally.

By X-axis there are countries, by Y-axes - sectors. The sectors are represented in different colors, borrower types - in different symbols. For example, the Agriculture sector has blue color, the female borrower type is circle.

The number of projects is shown by the size of symbol. The larger the symbol, the more number of projects is.

Analytical part

The main project-makers are located in Philippines, Kenya and Cambodia in the sectors of Agriculture, Food, Retail and Personal Use. The projects are placed in the sectors that ususally play significant role in these countries.

Let's make more closely look at economic indicators of the borrowers countries to understand their correlation with collected investments.

Comparison of projects' funded amounts with economic indicators¶

In [65]:
# calculate the total number of projects and total invested amount by the chosen countries
liste = df_bar4.country.unique()
df_countries_liste = df_opti.query('country.isin(@liste)')

df_countries=(df_countries_liste.groupby(by='country',observed=True)["funded_amount"]
              .agg(["count","sum","median","mean"])
              .sort_values(by='sum', ascending=False)
              .reset_index()
             )
# set readable data format
pd.set_option('display.float_format', '{:,.2f}'.format)

#list of four countries that have the most share of projects 
chosen=["Philippines","Kenya","Cambodia","El Salvador"]

#adding the data for GDP per Person and Population from internet (the data is for 2021)
gdp_pro_capita = [3460.53, 2081.8,4551.2,1625.24]
population = [113.9,53.01,6.314,16.59]

df_indicators = df_countries.query('country == @chosen').reset_index(drop=True).copy()
df_indicators['gdp_pro_capita']= gdp_pro_capita
df_indicators['population']= population
df_indicators['population']=df_indicators['population']*1000000

# adding ratios
df_indicators['population_to_count']=df_indicators['population']/df_indicators['count']
df_indicators['gdp_pro_capita_to_mean']=df_indicators['gdp_pro_capita']/df_indicators['mean']
df_indicators['months_to_work']=12/df_indicators['gdp_pro_capita_to_mean']
df_indicators= df_indicators.sort_values(by="country").reset_index(drop=True)
In [66]:
df_indicators
Out[66]:
country count sum median mean gdp_pro_capita population population_to_count gdp_pro_capita_to_mean months_to_work
0 Cambodia 34836 18,817,100.00 275.00 540.16 1,625.24 16,590,000.00 476.23 3.01 3.99
1 El Salvador 39875 23,357,724.00 500.00 585.77 4,551.20 6,314,000.00 158.34 7.77 1.54
2 Kenya 75825 32,248,404.00 300.00 425.30 2,081.80 53,010,000.00 699.11 4.89 2.45
3 Philippines 160441 54,476,376.00 275.00 339.54 3,460.53 113,900,000.00 709.92 10.19 1.18
In [67]:
# adding average term of loans 
df_t =(df_opti.query('country.isin(@chosen)').groupby(by="country",observed=True).term_in_months
               .agg(["median","mean"])
               .reset_index()
               .sort_values(by="country")
               .reset_index(drop=True)
      )
df_t
Out[67]:
country median mean
0 Cambodia 13.00 14.11
1 El Salvador 14.00 15.69
2 Kenya 13.00 12.63
3 Philippines 8.00 9.88
In [68]:
# adding the columns with terms' info to our table  
df_indicators[["average_term","median_term"]]=df_t[["mean","median"]]

# adding new ratios
df_indicators["av_monthly_pmnt_to_gdp_pro_cap"] =(
            (df_indicators["mean"]/df_indicators["average_term"])/(df_indicators["gdp_pro_capita"]/12)
            )

df_indicators["median_monthly_pmnt_to_gdp_pro_cap"] =(
            (df_indicators["median"]/df_indicators["median_term"])/(df_indicators["gdp_pro_capita"]/12)
            )
df_indicators
Out[68]:
country count sum median mean gdp_pro_capita population population_to_count gdp_pro_capita_to_mean months_to_work average_term median_term av_monthly_pmnt_to_gdp_pro_cap median_monthly_pmnt_to_gdp_pro_cap
0 Cambodia 34836 18,817,100.00 275.00 540.16 1,625.24 16,590,000.00 476.23 3.01 3.99 14.11 13.00 0.28 0.16
1 El Salvador 39875 23,357,724.00 500.00 585.77 4,551.20 6,314,000.00 158.34 7.77 1.54 15.69 14.00 0.10 0.09
2 Kenya 75825 32,248,404.00 300.00 425.30 2,081.80 53,010,000.00 699.11 4.89 2.45 12.63 13.00 0.19 0.13
3 Philippines 160441 54,476,376.00 275.00 339.54 3,460.53 113,900,000.00 709.92 10.19 1.18 9.88 8.00 0.12 0.12
In [69]:
#highlihting the values
def highlighter(cell_value):
    value1 = df_indicators['av_monthly_pmnt_to_gdp_pro_cap'].max()
    value2 = df_indicators['av_monthly_pmnt_to_gdp_pro_cap'].nlargest(2).tolist()[1]
    if cell_value in [value1,value2]:
        return "background-color: red"

df_indicators.style.applymap(highlighter)
Out[69]:
  country count sum median mean gdp_pro_capita population population_to_count gdp_pro_capita_to_mean months_to_work average_term median_term av_monthly_pmnt_to_gdp_pro_cap median_monthly_pmnt_to_gdp_pro_cap
0 Cambodia 34836 18817100.000000 275.000000 540.162476 1625.240000 16590000.000000 476.231485 3.008798 3.988303 14.110690 13.000000 0.282644 0.156190
1 El Salvador 39875 23357724.000000 500.000000 585.773621 4551.200000 6314000.000000 158.344828 7.769554 1.544490 15.688677 14.000000 0.098446 0.094167
2 Kenya 75825 32248404.000000 300.000000 425.300415 2081.800000 53010000.000000 699.109792 4.894893 2.451535 12.630373 13.000000 0.194098 0.133021
3 Philippines 160441 54476376.000000 275.000000 339.541504 3460.530000 113900000.000000 709.918288 10.191773 1.177420 9.881969 8.000000 0.119148 0.119201

Evaluation

Assumptions: We accept 'GDP pro capita' as an estimation of the person income in the country per year.

The 'population_to_count' shows how many people in the country receive funding through the platform. For example, each 158th resident of El Salvador proposed a project and received a financing. If we have the same ratio per resident in Germany, we would get the number of 526582 projects (taking into account the population of Germany of 83.2M).

The 'gdp_pro_capita_to_mean' shows how the year' GDP per person in the country correlates with the average amount of the project. The higher this ratio, the better is creditworthiness in a country.

For example, this ratio for Cambodia is equal ~3. That means that a resident of Cambodia has to work 4 months to repay the principal of the project only (please look at figures in 'months_to_work'. Here we do not take into account the project' term).

The 'av_monthly_pmnt_to_gdp_pro_cap' shows what part of monthly income the borrower has to pay to fullfil his obligations under the project without interest (based on average term and project size). The lower this ratio, the better is creditworthiness in a country.

For example, in Cambodia this ratio is more than 25%, that means the average project-maker has to pay monthly more than a quarter of his income for the loan.

The 'median_monthly_pmnt_to_gdp_pro_cap' shows what part of monthly income the borrower has to pay to fullfil his obligations under the project without interest (based on median term and project size).

The projects look risky especially for Kenya and Cambodia. The platform may think to provide the potential lenders with reports on the risks of investing in a particular country, taking into account historical data. However, it's clear that the solution is, of course, in the hands of the platform' management.

Analysis of lenders' participation¶

In [70]:
# max participation of one lender in one project 
max_stake =df_opti['lender_stake'].max()

df_opti.query('lender_stake == @max_stake')
Out[70]:
funded_amount loan_amount activity sector country_code country currency term_in_months lender_count borrower_genders repayment_interval currency_name lender_stake
237835 9,475.00 9,475.00 Used Clothing Clothing CD The Democratic Republic of the Congo USD 6 1 group_of_persons irregular US Dollar 9,475.00
In [71]:
# look closely at the investments > $1000
print("One Lender Stake, in USD:\n")

print(f"Median (all the data):{df_opti.lender_stake.median(): .2f}, mean:{df_opti.lender_stake.mean(): .2f}")

print(f"Median (lender stake>=1000, funded amount>5000'): \
{df_opti.query('lender_stake>=1000 and funded_amount>5000').lender_stake.median(): .2f}, \
{df_opti.query('lender_stake>=1000 and funded_amount>5000').lender_stake.mean(): .2f}")

print(f"Median (lender stake>=1000, funded amount>10000'): \
{df_opti.query('lender_stake>=1000 and funded_amount>10000').lender_stake.median(): .2f}, \
{df_opti.query('lender_stake>=1000 and funded_amount>10000').lender_stake.mean(): .2f}")
One Lender Stake, in USD:

Median (all the data): 31.25, mean: 60.95
Median (lender stake>=1000, funded amount>5000'):  2575.00,  2990.52
Median (lender stake>=1000, funded amount>10000'):  nan,  nan

Evaluation

In general, the behavior of investors looks cautious: the average participation in one transaction looks insignificant. Only in the deals' range from USD 5K to USD 10K, the average participation grows to USD 3K. In the range of transactions from USD 10K, this is no longer observed.

Creating Dash¶

In [72]:
# create data for dash plot
df_dash =(df_opti.groupby(by=["sector","country"],observed=True)['funded_amount'].agg(["count","median","mean","sum"])
        .sort_values(by=["sector","count"], ascending=[True,False])
        .reset_index()
       )
In [73]:
# import dash libraries
from jupyter_dash import JupyterDash
from dash import dcc

from dash import html

from dash import Dash
from dash.dependencies import Input, Output
In [74]:
options_list = sorted(df_bar4['Sector'].unique())
radio_items = df_dash.columns[2:]
#radio_items

options_dict=dict()
options_outputs = ["Number of Projects","Median Value, USD", "Mean Value, USD", "Total Amount Invested, USD"]

for i, element in enumerate(radio_items):
    options_dict[element]=options_outputs[i]
In [75]:
# 1. Create Application

my_dash = JupyterDash(__name__)

# 2. set Layout
                                          
my_dash.layout = html.Div(children = [
                            html.H4(children ="Please choose sector and function to build a plot:",
                                   style={"textAlign":"left",
                                         'font-family': 'Arial',
                                         'font-size': '14px',
                                         'font-style': 'italic'}  
                                    ),
    
                            dcc.Dropdown(id="id_sector",
                                                    options=options_list,
                                                    value=options_list[0],
                                                    style={"width":"50%",
                                                           'font-family': 'Arial',
                                                           'font-size': '14px'
                                                          }
                                                    
                                                   ),
    
                            dcc.RadioItems(id="id_radio",
                                         options = radio_items, 
                                         value = radio_items[0],
                                         inline=True,
                                         style={"textAlign":"center",
                                                'font-family': 'Arial',
                                                'font-size': '14px'} 
                                    ),
                            html.Br(style={"line-height": 15}),
    
                            dcc.Graph(id="id_graph",
                                      figure={}
                                    )
  
                            ])
# 3. Create Callback 

# 3.1 Link with Layout Elements

@my_dash.callback(
                 Output(component_id="id_graph", component_property="figure"),               
                 Input(component_id="id_radio", component_property="value"),
                 Input(component_id="id_sector", component_property="value")
          )

# 3.2 Create Update-Funktion: receives from Input, returns in Output

def update_graph(sort_value,sector):  
    df_plot = (df_dash.query('sector==@sector')
               .sort_values(by=sort_value, ascending=False)
               .reset_index(drop=True).head(10)
              )

    df_plot=remove_cats(df_plot)
    
    title_dash_bar = f"Top 10 Countries by {options_dict[sort_value].rstrip(', USD')} in Sector: '{sector}'"
    
    fig_bar1 = px.bar(data_frame=df_plot
                        ,x="country"
                        ,y=sort_value      
                        ,hover_name='country'
                        ,hover_data={'country':False,
                                      'count':True,
                                      'sum': ':$,.3s',
                                      'median':':$,.1f',
                                      'mean': ':$,.1f'
                                   }
                 )

    fig_bar1.update_layout(title_text = title_dash_bar                       
                    ,title_font_size =20
                    ,font_size=16
                    ,title_x =0.5
                    ,height=500  
                    )
    
    fig_bar1.update_yaxes(title_text=options_dict[sort_value]
                         )
    fig_bar1.update_xaxes(title_text="Country"
                         )

    return fig_bar1
    
# 4. App starten

# my_dash.run_server(mode="external", port=8099)
my_dash.run_server(mode="inline",port=8099)
Dash is running on http://127.0.0.1:8099/

Creating a map with Top Countries¶

In [76]:
#read the file with coordinates
df_map = pd.read_csv("country_projects_coordinates.csv", sep=',',index_col=0)
df_map.head(2)
Out[76]:
Country Capital Coordinates count sum median mean
0 Armenia Jerewan 40.18799033055038; 44.516029958561965 8631 11,186,675.00 1,150.00 1,296.10
1 Cambodia Phnom Penh 11.564518491643906; 104.89841106463936 34836 18,817,100.00 275.00 540.16
In [77]:
# preparation the data
# splitting the column Coordinates
df_map[['lat','long']] = df_map['Coordinates'].str.split(';',expand=True)
In [78]:
# adding rank and size of circle to show the Number of Projects for each country
df_map['rank']=df_map['count'].rank(method='dense', ascending=False).astype(int)
df_map['radius']=round((df_map['count']/10000 *1.2),0)
In [79]:
# builing a map
import folium
from folium.plugins import MarkerCluster

df_map['name_for_map'] = (df_map['Country']
                        + '<br><br>'
                        +'Number of Projects: '
                        + df_map['count'].astype('str')
                        + '<br>'
                        +'Rank: '
                        + df_map['rank'].astype('str')
                         )

my_map = folium.Map(location=[df_map['lat'][0],df_map['long'][0]], zoom_start=2)

df_map.apply(lambda row:folium.CircleMarker(location=[row['lat'],row['long']], 
                                            radius=[row['radius']],
                                            color='blue')
                         .add_to(my_map), 
                          axis=1
            )

df_map.apply(lambda row:folium.Marker(location=[row['lat'],row['long']],
                                      icon=folium.Icon(color="blue"),
                                      popup=row['name_for_map'])
                         .add_to(my_map), 
                          axis=1
            )

title_html = '''
             <h3 align="center" style="font-size:20px"><b>Top Countries by Number of Projects</b></h3>
             '''
my_map.get_root().html.add_child(folium.Element(title_html))

my_map
Out[79]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Descriptive part

The top countries that have more than 3000 projects in portfolio are shown on the map. The markers set to geographical coordinates of the capital of each country. The blue circle reflects the number of deals in each project. The larger the circle, the more projects are financed through the platform.

Analytical part

The most part of countries represented here belong to the poor category of countries with year's GDP per capita (Gross Domestic Product per person) less than USD 5K with low creditworthiness and payability.

*The data that was used in analysis before is for 2021. The picture below from Wikipedia shows the current data for 2023. It should be noted that in the last two years some countries were shifted to the next category: USD 5K - 10K, for example, Philippines.

It may be useful to make a detailed look at the countries in higher 'GDP per capita' categories (please see the picture below) and to promote the platform there. However, it's clear that the solution is, of course, in the hands of the platform' management.

picture

In [ ]: